Fitting German into N-Gram Language Models
نویسندگان
چکیده
We report on a series of experiments addressing the fact that German is less suited than English for word-based n-gram language models. Several systems were trained at different vocabulary sizes using various sets of lexical units. They were evaluated against a newly created corpus of German and Austrian broadcast news.
منابع مشابه
Hierarchical Bayesian Language Modelling for the Linguistically Informed
In this work I address the challenge of augmenting n-gram language models according to prior linguistic intuitions. I argue that the family of hierarchical Pitman-Yor language models is an attractive vehicle through which to address the problem, and demonstrate the approach by proposing a model for German compounds. In an empirical evaluation, the model outperforms the Kneser-Ney model in terms...
متن کاملFitting long-range information using interpolated distanced n-grams and cache models into a latent dirichlet language model for speech recognition
We propose a language modeling (LM) approach using interpolated distanced n-grams into a latent Dirichlet language model (LDLM) [1] for speech recognition. The LDLM relaxes the bag-of-words assumption and document topic extraction of latent Dirichlet allocation (LDA). It uses default background ngrams where topic information is extracted from the (n-1) history words through Dirichlet distributi...
متن کاملWider Context by Using Bilingual Language Models in Machine Translation
In past Evaluations for Machine Translation of European Languages, it could be shown that the translation performance of SMT systems can be increased by integrating a bilingual language model into a phrase-based SMT system. In the bilingual language model, target words with their aligned source words build the tokens of an n-gram based language model. We analyzed the effect of bilingual languag...
متن کاملWmt ’ 13
This paper describes LIMSI’s submissions to the shared WMT’13 translation task. We report results for French-English, German-English and Spanish-English in both directions. Our submissions use n-code, an open source system based on bilingual n-grams, and continuous space models in a post-processing step. The main novelties of this year’s participation are the following: our first participation ...
متن کاملNTT - NAIST Syntax - based SMT Systems for IWSLT 2014
This paper presents NTT-NAIST SMT systems for English-German and German-English MT tasks of the IWSLT 2014 evaluation campaign. The systems are based on generalized minimum Bayes risk system combination of three SMT systems using the forest-to-string, syntactic preordering, and phrase-based translation formalisms. Individual systems employ training data selection for domain adaptation, truecasi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002